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Abstract 

We propose a novel sparse tensor decomposition method, namely Tensor Truncated 
Power (TTP) method, that incorporates variable selection into the estimation of decom¬ 
position components. The sparsity is achieved via an efficient truncation step embedded 
in the tensor power iteration. Our method applies to a broad family of high dimensional 
latent variable models, including high dimensional Gaussian mixture and mixtures of 
sparse regressions. A thorough theoretical investigation is further conducted. In partic¬ 
ular, we show that the final decomposition estimator is guaranteed to achieve a local 
statistical rate, and further strengthen it to the global statistical rate by introducing a 
proper initialization procedure. In high dimensional regimes, the obtained statistical rate 
significantly improves those shown in the existing non-sparse decomposition methods. 
The empirical advantages of TTP are confirmed in extensive simulated results and two 
real applications of click-through rate prediction and high-dimensional gene clustering. 


1 Introduction 

Tensor as a multi-dimensional generalization of matrix has been widely used in industry, e.g., Carroll 
and Chang (1970); Bro and Kiers (2003), and is being actively studied in the community of machine 
learning (Karatzoglou et ah, 2010; Rendle and Schmidt-Thieme, 2010; Zheng et ah, 2010; Chi and 
Kolda, 2012; Liu et ah, 2013) and statistics (Zhou et ah, 2013; Yang and Dunson, 2013; Yuan 
and Zhang, 2014). In particular, significant progress has been made toward tensor decomposition 
(Signoretto et ah, 2014; Anandkumar et ah, 2014b), which has shown great success in personalized 
recommendation. Traditional recommendation systems are mainly based on the user-item matrix, 
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whose entry represents each user’s behavior on a particular item. To incorporate the contextual 
information into the analysis, we need to consider a user-item-context tensor. For example, in online 
advertising, users’ click behaviors on a jacket advertisement in the winter could be very different 
from those in the summer. In this example, the goal of recommendation systems is to recommend 
to a user the most suitable advertisement in different temporal circumstances such that this user 
has a large chance to click it. We refer readers to Kolda and Bader (2009a) for a thorough overview 
of tensor decompositions and their applications. 

As far as we are aware, most existing tensor decomposition results are established in the non- 
sparse regime where the decomposition components include all features. In many applications, the 
tensor contains many zeros and the decomposition components are very sparse. For example, in the 
online advertising example, the probability of a click event is very tiny and the observed user-item- 
context tensor contains many zeros, i.e., no click. In addition, in high dimensional tensor cases, many 
features in the components essentially contain no information about the tensor structure, and thus 
the performance of tensor structure recovery may not be desirable. Moreover, the interpretability of 
tensor decomposition is inevitably impeded by including non-relevant features. At last, in many 
applications, it is important to identify which features are crucial. It has been shown that tensor 
decomposition is applicable for clustering (Anandkumar et ah, 2014b). In high-dimensional gene 
clustering problem, recognizing the relevant genes for clustering is of great interest for scientific 
discovery. Hence, a more appropriate method that can simultaneously perform tensor decomposition 
and select informative features is in need. 

In this paper, we propose a new sparse tensor decomposition method called Tensor Truncated 
Power (TTP) that encourages the sparsity of each decomposition component by incorporating a 
truncation step into the tensor power iteration step. Specifically, in each iteration, the decomposition 
components are first updated via a tensor power method (Lathauwer et ah, 2000; Anandkumar 
et ah, 2014c), and then truncated to only preserve the entries of s largest magnitudes. Here 
the parameter s is much smaller than the maximal dimension d in all modes, and can be tuned 
in a data-driven manner. This truncation step efficiently imposes the desirable sparsity of the 
decomposition components, and hence significantly improves the statistical rate as shown later. 
Moreover, we provide a provable sparse SVD initialization procedure, which can eventually lead to 
the global statistical rate of our final estimator. 

In theory, we establish both local and global statistical rates of the proposed method. In 
particular, for an observed noisy tensor with a perturbation error £, we denote its sparse norm as 
r](£,dQ) with do the maximum number of nonzero entries in the decomposition components of the 
true tensor; see (3.1) for details. Let s > do be the maximum number of nonzero elements of the 
estimated decomposition components in our algorithm and let the rank (defined in (2.2)) of the 
true tensor be K. Given an appropriate initialization with an estimation error cq, our TTP method 
requires only O(log(eo/eij)) steps to achieve the desirable statistical rate e^, where^ 


= o( v{£,do + s) -F VK/do 


Sample Error Model Error 


^For two sequence we say a„ = 0{bn) if there exists some positive constant Co and sufficiently large n snch 

that a„ < Cob„, on the other hand, we say a„ = fl{b„) if bn — 0(a„). 
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Figure 1: An illustration of our theoretical analysis. The global statistical rate is shown by first 
proving that the sparse SVD (SSVD) initialization produces an estimator in the basin of attraction, 
and then carefully quantifying the estimation error at each update step. 

This statistical rate can be decomposed into two parts, where the sample error captures the noise 
level of the observed tensor and the model error measures the complexity in decomposing the 
true tensor. In high-dimensional regimes, this rate significantly improves those developed in the 
non-sparse tensor decomposition methods (Anandkumar et ah, 2014c), see Section 3.1 for detailed 
discussions. Figure 1 illustrates our theoretical analysis. When the initialization is not far away from 
the true parameters, i.e., the so-called “basin of attraction”, then our TTP estimator is guaranteed 
to continuously move toward the convergence region around the true parameters. 

A by-product of our TTP method is to solve high-dimensional latent variable models. In 
order to illustrate the applicability, we employ our method to two statistical problems: high¬ 
dimensional clustering (Hsu and Kakade, 2013) and mixture of sparse regressions (Chaganty and 
Liang, 2013). Finally, extensive experiments are implemented to backup the theoretical developments 
and demonstrate the superior performance of our procedure. 

1.1 Comparison with Related Work 

A related work on tensor decomposition is the robust tensor decomposition method with non-sparse 
decomposition components proposed by Anandkumar et al. (2014c). Under certain conditions, they 
prove that their method is able to recover the decomposition with an error rate 0{r]{£,d) -f y/K/d) 
when the dimensions of all decomposition components equal to d. In the high-dimensional regimes 
where do d, the above error is dominated by the sample error ri{£,d), which is significantly larger 
than our sample error r}{£, do + s). In order to address high dimensionality, one key ingredient of our 
method is a new truncation step built upon their robust tensor decomposition method to encourage 
the sparsity structure of the decomposition components. This additional truncation step demands 
more challenging technical analysis than those in Anandkumar et al. (2014c). In particular, we need 
to carefully characterize the impact of the intermediate sparse update on the estimation error in 
each iteration step. See Section 2.3 for more discussions. 

Another line of research focuses on the convex relaxation of the low rank tensor decomposition 
and completion problem. For example, the convex relaxation is achieved by a generalized trace 
norm (Romera-Paredes and Pontil, 2013), a tensor Schatten 1-norm (Liu et al., 2014; Gu et ah. 
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2014), or a novel tensor nuclear norm (Yuan and Zhang, 2014). Kolda and Bader (2009b); Plantenga 
et al. (2015) and Kolda (2015) propose a straightforward optimization method to the symmetric 
real-valued or nonnegative tensor decomposition. However, all these works mainly focus on low rank 
tensor recovery and their decomposition components are generally non-sparse. 

In sparse tensor decomposition, Morup et al. (2008) develops a sparse non-negative Tucker 
decomposition approach by matricizing the tensor and employing the norm penalty as the 
sparsity constraint. However, as commented in Allen (2012), in the high-dimensional case, sparse 
optimization after matrization is computationally intensive and requires a large amount of computer 
memory. Instead, Allen (2012) directly adds an penalty on the decomposition vectors in the 
rank-1 best approximation optimization problem and solves it via alternative soft thresholding 
update, while Liu et al. (2012) suggests to solve the sparse non-negative tensor factorization via a 
coordinate descent method. Although these sparse procedures show good empirical performance in 
variable selection, no theoretical analysis on these sparse estimators is available. Chi and Kolda 
(2012) propose a nonnegative decomposition algorithm for tensors whose entries are sparse and 
their theoretical analysis requires that the tensors are generated from a Poisson distribution. To 
the best of our knowledge, our TTP method is the first sparse tensor decomposition method with 
guaranteed local and global statistical rates. 

1.2 Notation 

The following notation is adopted throughout this paper. Denote [d] = {1,... ,d}. For a matrix 
A = (ojj) G we denote ||A|| as its spectral norm. We define Aj as the j-th column and PJ as 

the j-ih. row of A. Furthermore, A\j G is a submatrix of A with its j-th column removed, 

and A''-^ G is A with its j-th row removed. We denote the dxd identity matrix as or I 

when no confusion arises. For a vector v = (vi,..., v^)"*^ G ||v|| refers to its Euclidean norm 
and ||v||o denotes the number of nonzero entries of v. For an index set X C [d], we define vj as the 
vector whose i-th entry is equal to Vj if z G X, and zero otherwise. Let supp(v) be the index set of 
its nonzero entries. Let 1(A) be the indicator function which equals 1 when the event A is true and 
0 otherwise. We denote o to be the outer product between vectors. Throughout this paper, we use 
Co, Cl,... to denote generic absolute constants, whose values may vary from line to line. 

1.3 Paper Organization 

The rest of the article is organized as follows. Section 2 introduces our sparse tensor decomposition 
method TTP and its implementation. The local and global statistical rates of the proposed TTP 
method are established in Section 3. Section 4 describes practical selections of tuning parameters. 
Section 5 presents the simulation results, followed by a real application in Section 6. The online 
supplementary material explains affiliated steps in our main algorithm, discusses applications of 
the TTP method to high-dimensional latent variable models, and includes additional experimental 
results and all technical proofs for the theoretical developments. 
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2 Sparse Tensor Decomposition 


In this section, we introduce some preliminary background on tensors, and then propose our sparse 
tensor decomposition method as well as its efficient implementation. 


2.1 Preliminary 


Consider a third-order tensor T G ]^<^ix'^ 2 xd 3 _ adopt the tensor notation in the review article 
by Kolda and Bader (2009a). In tensors, a fiber refers to a higher order analogue of matrix row 
and column. A fiber is obtained by fixing all but one of the indices of the tensor. For a third-order 
tensor T, the mode -1 fiber is given by mode -2 fiber by and mode-3 fiber by [7~]ij,:. 

We similarly define a slice of a tensor by fixing all but two of the indices. For instance, the slice 
along mode -1 is given as [T]i, 

We next define the vector product of a tensor. For a vector G with k = 1,2,3 and a 
tensor T G ]K'^ixrf 2 xd 3 ^ define the mode-1, mode-2, and mode-3 vector product as. 


r := 






T X 2 := 

je[d2] 




r X3 u(3) := 


te[rf3] 




which are the multilinear combinations of the tensor slices. We also define the multilinear combination 
of the tensor mode -1 fibers and the multilinear combination of the tensor entries as 


T X 2 X 3 := 


(3) 


Euf>uf> 

3 / 


T Xi X2 X3 := 


(1) (2) (3) 
u:- mi ^ 


j 


'm 


i,j/- 




Similar definitions apply to T Xi 
norm of a tensor T as ||T|| := sup||u|| = 


\\T\\f := 



X 2 and T Xi X 3 We define the spectral 

|v||=||w||=i |T X 1 UX 2 VX 3 WI and its Frobenius norm as 


2.2 Sparse Tensor Decomposition Method 

Tensor decomposition aims to express a tensor as the sum of rank one tensors. Specifically, a tensor 
7 " £ ^dixd 2 xd 3 jg have a rank K if it can be written as the sum of K rank-1 tensors, that is 

r = E^e WiSii o bj o Cj, where tCj G M and a* G bj G c, G Here, we assume a*, bj, Cj 
to be unit vectors, since otherwise the normalized terms can be incorporated in the coefficient 
Wi- Given an observed tensor 7~, which can be written as T = T + £ with T being the true 
tensor and £ being the error tensor, we can recover its low rank decomposition by minimizing 
\\f-E^e WiSii o bi o Cj||i? subject to constraints on Wi,ai,hi, and Cj (Carroll and Chang, 1970; 
Bro and Kiers, 2003). 

In the simplest case where K = 1, the single-factor tensor decomposition solves ||T —rcaoboc||i? 
subject to ||a|| = ||b|| = ||c|| = 1 and tc > 0, whose solution is given by Allen ( 2012 ) as, 

a = Norm(T X 2 b X 3 c), b = Norm(T Xi a X 3 c), c = Norm(T Xi a X 2 b), (2.1) 

where Norm(v) = v/||v|| is a normalization operator on a vector v. This procedure provides an 
iterative coordinate update procedure for the single-factor tensor decomposition. To compute all the 


5 



K decomposition components, one can apply this single-factor procedure sequentially to the residual 
tensor left after subtracting previously recovered ones. The single-factor tensor decomposition 
procedure incorporating this deflation strategy is known as the tensor power method (Kolda and 
Bader, 2009a; Anandkumar et ah, 2014c), which is efficient for non-sparse tensor decomposition. 

In this paper, we consider a model of sparse and low-rank tensor decomposition. We assume 
that T G 'g^d.ixd 2 xd 3 gparse and has rank K such that 

T= ^ rcja* o bj o Cj, rcj G M, a* G bj G c* G (2.2) 

ie[K] 

where S'^“^(]R) = {v G | ||v|| = 1} and ||aj||o < doij ||t)i||o < do 2 ) ||cj||o < do 2 for any i G [K], 
Moreover, we assume zumax = wi > ■ ■ ■ > wk = w^min > 0 and assume each Wi to be bounded away 
from 0 and oo. 

In order to recover the sparse and low-rank decomposition components, our sparse tensor 
decomposition method attaches the following truncation step to each tensor power update. To 
introduce the truncation step, we first define two relevant operations: for a vector v G and an 
index set T C [d], we define Truncate(v, T) as 

r , M f Vj if f G T 

[Truncate(v, T)]i = < , 

[O, otherwise 

and for a scaler s < d, we denote Truncate(v, s) = Truncate(v, supp(v, s)), where supp(v, s) refers 
to the set of indices of v corresponding to its largest s absolute values. Building upon the tensor 
power update in (2.1), our sparse tensor decomposition method updates the sparse components via 

a = Truncate(a, si), b = Truncate(b, S 2 ), c = Truncate(c, S3), (2.3) 

where si < di,S 2 < d 2 ,S 3 < d^ are the corresponding sparsity levels. Then the components are 
normalized in order to satisfy the unit norm constraint. We refer our method as Tensor Truncated 
Power (TTP) method. The detailed algorithm is proposed in Section 2.3. As will be shown later, 
this truncation step achieves desirable variable selection effect and improves the tensor structure 
recovery performance in the high-dimensional settings. 

Note that the above truncation idea has recently shown to be successful in a wide context of 
high dimensional problems. For example. Yuan and Zhang (2013) employ it to extract the largest 
sparse eigenvectors of a high dimensional matrix. Wang et al. (2014) apply this strategy to the EM 
algorithm to produce a sparse solution in high dimensional cases. It is worth noting that our paper 
is the first one to incorporate this truncation strategy into the tensor decomposition problem. 

Our sparse tensor decomposition method is applicable to solve the low rank tensor approximation 
problem which approximates an observed tensor T G ^ gparse and low rank tensor in 

(2.2). This is a generalization of penalized matrix decomposition considered in Witten et al. (2009). 
The sparse tensor approximation aims to minimize \\T — ° bj o Cj||i? subject to the 

constraints that Wi G M"*", a* G G ^ ^nd ||aj||o < doi; ||bi||o < ^ 02 , ||cj||o < 

do 2 - However, solving the optimization problem directly is computationally challenging. We can 
apply our TTP method to find an approximation to the solution of this optimization problem. 
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2.3 Algorithm 

In Algorithm 1 below, we present more implementation details of the proposed TTP method. 

Algorithm 1 TTP Method for Sparse Tensor Decomposition 
1 ; Input: tensor T G 'g^dixd 2 xd 3 ^ number of initializations L, number of iterations N, cardinality 
vector (si,S 2 ,S 3 ), rank K. 

2 : For r = 1 to L Do 

3: Initialize unit vectors G G G 

4: For t = 1 to A: Alternatively update the components a^ \ \ as 

= Norm^T X2 

= Norm^T Xia![*“^^ 

pW _ Norm^T Xi X2 

5: End For 

6 : End For 

7; Output: the cluster centers {aj,hj,Cj), j G 

K clusters and their corresponding Wj = T x i a^ x 2 hj x 3 Cj. 


_ Truncate(a^*\ Si); a® = Norm(a[*^), (2.4) 
_ Truncate(b!^*\ S 2 ); b^*^ = Norm(b![.*^), (2.5) 
_ Truncate S3); = Norm(c^*^). (2.6) 

[K] by clustering | (ai^\ br^\ ), r G [T]| into 


The key step of our TTP procedure is the truncated power updates in (2.4)-(2.6). In each iteration 
t of the inner loop, when updating the first decomposition component in (2.4), we first compute the 
non-sparse component slt'^ via classical tensor power method in ( 2 . 1 ), then the truncation step keeps 
only the top si entries in a^*^ and sets the rest entries as zero. Finally, the truncated component 
is normalized to a^ ^ which ends a full update cycle of the first decomposition component. Our 
TTP method updates all three components alternatively until a pre-specified maximal number of 
iterations is achieved. This termination condition can be modihed to the case when the changes of 
components are below some thresholding value, see Section 4.1 for more details. 

For each initialization r, the procedure runs N iterations of updates in (2.4)-(2.6) to generate the 
converged decomposition components. These procedures are repeated for L different initializations, 
where the number L is a polynomial function of K. A theoretical lower bound of L will be given in 
Section 3 and a practical choice of L will be provided in Section 4. In the algorithm, we suggest two 
initialization procedures, one is a sparse SVD initialization and another is a random initialization. 
The former has a nice theoretical guarantee, while the latter is simple and fast in practice. The 
detailed initialization procedures are discussed in Section S.1.1 in the supplementary. In the last step, 
the algorithm clusters the L tuples of br^\ ) into K clusters to output all components. 

This clustering step will be discussed in details in Section S.1.2 in the supplementary. In practice, 
the parameters si, S2, S3, and K can be determined via a data-driven tuning procedure and will be 
discussed in details in Section 4. 

It’s worth mentioning that our Algorithm 1 is built based on the non-sparse tensor decomposition 
algorithm proposed by Anandkumar et al. (2014c). In order to address the high dimensionality. 
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one key ingredient of our method is a new truncation step in (2.4)-(2.6) to encourage the sparsity 
structure of the decomposition components. Despite its simplicity, this additional truncation step 
demands more challenging technical analysis than those in Anandkumar et al. (2014c). In particular, 
we need to carefully characterize the impact of the intermediate sparse update on the estimation 
error in each iteration step. As we will show in the next Section, such truncation step leads to 
a much improved statistical rate in high-dimensional tensor decomposition scenario. Moreover, 
different from the sparse decomposition procedure considered in Allen ( 2012 ) which imposes ii 
penalty on each component, our TTP method encourages the sparsity via a direct (.q cardinality 
constraint. Although the penalized decomposition procedure in Allen ( 2012 ) shows good empirical 
performance in variable selection, no theoretical analysis of its performance was available. To the 
best of our knowledge, our TTP method is the first sparse tensor decomposition method with 
guaranteed both local and global statistical rates. 


3 Theoretical Analysis 


This section establishes both local and global analysis of our TTP method. Specifically, to show the 
local statistical rate, we require an incoherence condition on the true tensor, a constraint on the 
perturbation error, as well as an appropriate initialization; to show the global statistical rate, we 
then employ the local analysis coupled with the error analysis of the newly introduced sparse SVD 
initialization procedure. 

Recall that we focus on the sparse and low-rank tensor decomposition of the true tensor 
T G ]^'iiX'^2xd3 g^g in (2.2). In practice, we usually observe a perturbed tensor T based on 

limited samples. That is, given the perturbation error T, the observed tensor 7~ can be written as 
T = T + £. To quantify the noise level of the error, we define the sparse spectral norm of £ as 


'ni^:doi,do2, dos) ■— 


sup 

||u|| = ||v|| = ||w||=l 
|u||o<<ioi,||v||o<rfo2,||w||o<cio3 


TX1UX2VX3W 


(3.1) 


Here r/(T, doi, do 2 , dos) quantifies the perturbation error in a sparse scenario, and in the sparse tensor 
decomposition case with doi di, do 2 ^ 2 , do 3 ^ ^ 3 , it is much smaller than the spectral norm 
||T||, which equals rj{£,di,d 2 ,d^). Denote do = iiiax{doi, do 2 , do 3 }, we have ??(T, doi, do 2 , do 3 ) < 
rj{£, do, do, do) and for simplicity we denote r/(T, do) := r]{£, do, do, do). 

Our theoretical analysis studies the sufficient conditions under which the estimated components 
(aj,hj,Cj) from our TTP method converge to the truth (aj,hj,Cj) for any j G [K], Moreover, 
our analysis quantifies its specific statistical rate. In order to compute the distance between the 
estimator and the truth, we define the distance measure between two unit vectors u, v G as 

D(u,v) := yi-(uTv)2. (3.2) 

For unit vectors u, v, we have D(u, v) < min{||u — v||, ||u-|-v||} < \/2T)(u, v). The distance function 
D(u,v) resolves the sign issue in the decomposition components since changing the signs of any two 
components vectors while fixing the third component vector will not affect the generated tensor. 





Before presenting the main theorems, we introduce assumptions on the identifiability and 
incoherence on the true tensor 7~, which play an important role in our subsequent theoretical 
analysis. 

Assumption 3.1 (Identifiability). The tensor decomposition of T in (2.2) is unique in the sense 
that if the tensor has another decomposition T = Xlie [K'] w'i^i ° b' o c' with a' G b' G 

grf 2 -i^ c' G and w[ ^ 0, we have K = K' and there must exist a permutation a of {1,..., A} 

such that w' r\ = Wi, sl = a*, b' = bj and c' ,.s = m. 

Assumption 3.2 (Incoherence). The decomposition components are incoherent such that 

C :=max{|(ai,aj)|,|(bi,bj)|,|(ci,Cj)|} < (3.3) 

Vdo 

with do = max{(ioi, do 2 , doa} and for any j, || Wi{ai, aj)(bj, bj)cj|| < CiWma.xVKC. Moreover, 
matrices A ;= [ai, • • • , ax]-, B := [bi, • • • , b/^], and C ;= [ci, • • • , ck] satisfy max{|| A||, ||B||, ||C||} < 
1 + C 2 \/K/do for some positive constants Co, Ci, € 2 - 

Remark 3.3. Kruskal (1976, 1977) provide the classical condition of the identifiability of tensor 
decomposition, that is, it is sufficient for the uniqueness of the decomposition in ( 2 . 2 ) if /ca+^b+^c ^ 
2K + 2, where kx, k^, kc are the Kruskal ranks ^ of the matrices A, B, C. Such a condition requires 
that the rank is of the same order as the dimension of the tensor. Under the overcomplete case 
that K > maxjdi, d 2 , ds}, Chiantini and Ottaviani (2012) prove that the set of tensors not having 
a unique tensor decomposition has Lebesgue measure zero and show that the generic identifiability 
condition holds if A < (di + l)(d 2 + 1)/16. Therefore, Assumption 3.1 is satisfied for most of the 
tensor decomposition problems. 

Remark 3.4. The incoherence condition can be viewed as a relaxation of the orthogonality of 
decomposition components. It is originally introduced by Donoho and Huo (2001) and has been 
widely studied in high-dimensional scenarios, for example, compressed sensing (Candes and Romberg, 
2007) and matrix decomposition (Chandrasekaran et ah, 2012). In the experiments, we will illustrate 
that the incoherence condition of Assumption 3.2 holds if the components aj,bj,Cj are randomly 
generated from the unit and sparse space. 

3.1 Local Statistical Rate 

This section introduces our first main theoretical result of local analysis under the assumptions on 
true tensor, the perturbation error, as well as the initialization. 

We start the local analysis by defining the error term. Recall that £ is the tensor of perturbation 
error, do = niax{doi; do 2 , doo} is the maximal number of nonzero elements in the true decomposition 
components, s = maxjsi, S 2 , S 3 } is the maximal number of nonzero elements in the estimated 
decomposition components from Algorithm 1, and A is the tensor rank. Denote the error 

eR := - ri{£, do + s)-\ -VAC . (3.4) 

^min ^min 

^The Kruskal rank of a matrix is the maximal number k such that every k columns of the matrix are linearly 
independent. 
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The first term in (3.4) represents the sample error caused by the perturbation tensor E and the 
second term is the model error characterized by the incoherent parameter C^. If the eigenvectors 
are orthogonal, the incoherent parameter C = 0 the model error in (3.4) disappears. Under 
Assumption 3.2, the error term has upper bound 2\/5(t(;min)~^[??('£’, do + s) + C'gC'irCmaxV^/do]- 
We also note that when the sparsity do is small. Assumption 3.2 is not sufficient for the estimation 
consistency. However, as long as we have \fKC^ = o(l), the error term is controllable. 

The local statistical rate relies on the following initialization condition. 

Assumption 3.5 (Initialization). Define the initialization error cq := max{D(a(*^), a^), bj)} 

for some j G [K]. We assume that 


eo < 7 := min 


'W^min 

SrCmax 


CiVK Wm\T^ 
do ’ dv/SCsrCmax 



Given a^®^ b^'^^ the sparse vector is calculated based on (2.6). 


We are now ready to introduce our first main theorem on local analysis. 


Theorem 3.6. (Local statistical rate) Consider the model in (2.2) satisfying Assumptions 3.1 

Q /o — 

and 3.2, and assume ||T|| < and K = o{dQ ) with do = max{doi, do 2 , dos}. Let T be 

an input to Algorithm 1 . Assume the perturbation error satisfies r]{£,dQ + s) < u)min /6 and the 
initialization (a*^*^\ b^'^^ c*-^^) satisfies Assumption 3.5 for some j G [K], The solution from the inner 
loop of Algorithm 1 with s* > doi for i = 1,2,3, after N = 0( log(eo/eH)) iterations, satisfies with 
high probability, 

max|L>(a(^\aj),D(b(^),bj),L>(c(^),Cj)| < 0{eR). (3.5) 

Moreover, let tc = T X1 a^^) X 2 b^^) X3 c^'^\ then we have |u} — Wj\ < 0(e_R) with high probability. 

The proof of Theorem 3.6 is shown in Section S.4.1. The key idea is to show that the estimator 
from our TTP procedure has an error contraction effect in each iteration, which leads to the desirable 
contraction result after sufficiently many iterations. 

Theorem 3.6 establishes the local statistical rate. According to (3.4), the error term e/j is of the 
order 0{ri{£,do + s) + y/K/do). Here the recovery error er consists of two parts: the first part is 
the sample error due to the perturbation error of the observed tensor which is unavoidable, and the 
second part is the model error characterized by the incoherence Assumption 3.2 (that allows the 
non-orthogonality of decomposition components). Consider the case that there is no perturbation 
error, i.e., £ = 0 and the error er is in the order of y/Kjdo. In this case, for any fixed K, when do is 
large, the incoherent parameter in Assumption 3.2 becomes small and hence each decomposition 
component is nearly orthogonal, which leads to a simple tensor decomposition problem; on the 
other hand, when do is small, the structure of decomposition components could be more complex, 
and hence the recovery error is very large. Moreover, the requirement on the number of iteration 
N = U( log(eo/eij)) indicates that, if the initialization is appropriate, i.e., eo is small, we only need 
a few steps to achieve a desirable error bound. 
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Remark 3.7. It is worth noting that in the high dimensional regimes, our error rate er significantly 
improves the rate shown in Anandkumar et al. (2014c). Under certain conditions, Anandkumar 
et al. (2014c) prove that their method is able to recover the decomposition with an error rate 
0{ri{£, d) + y/K/d) when di = d 2 = = d. In the high-dimensional regimes where d is large, this 

error is dominated by the sample error r](£,d), which is significantly larger than our sample error 
T](£,do + s). This improvement is further backed up by our experimental studies in Section 5. 

If we replace the truncation step Truncate(-, s) in (2.4) - (2.6) of our Algorithm 1 by a soft- 
thresholding operator S{-, p) = sign(-) max(| • | — />, •) suggested by Allen ( 2012 ), we can achieve the 
following theoretical result similar to the one shown in Theorem 3.6. Therefore, our proof techniques 
are also applicable to establish the theoretical performance of the lasso penalized sparse tensor 
decomposition method in Allen ( 2012 ). 

Corollary 3.8. Suppose the same conditions in Theorem 3.6 are satisfied with s = do- If the 
tuning parameter in the soft-thresholding operator satisfies p> sr and the minimal absolute value 
of the nonzero entries of {aj,bj,Cj} for 1 < i < K is larger than 2p, we have the same rate as (3.5) 
with high probability. 

3.2 Global Statistical Rate 

In order to show the global statistical rate of our TTP method, we need to quantify the error bound 
of the SVD-based initialization in Algorithm 3. In particular, based on the theoretical analysis of 
the initialization, i.e.. Lemma S.4.2 in the supplementary, we establish the following global result. 

Theorem 3.9. (Global statistical rate) Consider model in (2.2) satisfying Assumptions 3.1 and 
3.2, and assume ||T|| < and K = O{do) with do = max{doi, do 2 , doa}. Let T be an input to 

Algorithm 1 . Assume the perturbation error satisfies r]{£, do-|-s) < min |rCmin/ 6 , (rcmin 

for some constant C 5 > 0. Let the number of initializations L = with 7 defined in As¬ 

sumption 3.5 and the number of iterations N = U( log( 7 /e/{)). For any j £ [K], the output of our 
algorithm with Sj > do* for i = 1, 2, 3 satisfies 

max^D{aj,a.j),D{hj,hj),D{cj,Cj)'^ < 0{eR), 

\wj-Wj\ < 0{eR), 


with high probability. 

Theorem 3.9 establishes the global statistical rate of the whole procedure by using the sparse 
SVD as an initialization. Compared to the assumptions in local analysis, Theorem 3.9 requires 
stronger conditions on K and r/(T,do -|- s). It’s worth noting that in general 7 is a constant and 
hence the number of initialization L is a polynomial function of K. In Section 4, we will discuss 
how to choose these parameters in practice. 
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4 Practical Choice of Tuning Parameters 


In Section 3, for simplicity we assume the number of initializations L, the number of iterations N, 
the rank K, and the cardinality parameters si, S 2 , S 3 in Algorithm 1 are known. In this section, we 
discuss how to choose these tuning parameters in practice. 


4.1 The Number of Initializations and The Nnmber of Iterations 


Theorem 3.9 provides a theoretical condition on the number of iterations L = which is a 

polynomial function of K. Based on our extensive experiments in Section 5, we find that in practice 
it is sufficient to choose L = maxjlO, 

Moreover, in practice we do not need to specify the number of iterations N in advance, instead 
we set a termination condition of the truncated power update (2.4)-(2.6) in Algorithm 1 as 


max-^ 


lal*^ — a. 




IcW - 


< 10 


-4 


for each iteration r. Similar strategy has been used in the non-sparse tensor decomposition algorithm 
in Anandkumar et al. (2014c). The convergence of the truncated power method has been shown in 
the sparse matrix decomposition problem by Yuan and Zhang (2013), which can be easily extended 
to our tensor decomposition case. 


4.2 Rank and Cardinality Estimation 


Our TTP method relies on two key components: the rank K and the cardinality parameters si, S 2 , S 3 . 
It has been shown that exact tensor rank calculation is an NP hard problem (Kolda and Bader, 
2009a). In this subsection, following the tuning method in Allen (2012), we provide a BIC-type 
criterion to estimate these parameters in practice. 

Given a pre-specified set of rank values K, and a pre-specihed set of cardinality values 5 i, 52 , 53 , 
we choose the combination of parameters (iiT, si, S 2 ) S 3 ) which minimizes 


BIG := log 


T - YjialK] O bj O C, 


i\\F 


did2ds 


+ 


\og{did2d3) 

did2dz 


(ll^-illo + l|bi||o + llcillo) 

i&[K] 


(4.1) 


The detailed tuning procedure is shown in Algorithm 2. The efficacy of this tuning procedure is 
evaluated in various experimental results in Section 5. 


Algorithm 2 Tuning Procedure for TTP Method 
1 ; Input: tensor T, set of rank values /C, set of cardinality values 5i,52,53. 

2 : For each A G /C, si G 5i, S 2 G ^ 2 , S 3 G ^3 Do 

3: Step 1: Run Algorithm 1 with parameters K, si,S 2 ,S 3 , and L = maxjlO, A^}. 

4: Step 2: Gompute BIG in (4.1) using T and the decomposed components from Step 1. 

5: End For 

6 : Output: (A, s'!, s^ 2 ) 53 ) which minimizes the corresponding BIG. 
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5 Simulation Study 


In this section, we demonstrate the mechanism of the proposed TTP method and illustrate its 
superior performance in sparse tensor recovery. 

In order to measure the recovery quality of the tensor decomposition component as well as its 
weight, we calculate the mean vector estimation error and weight estimation error: 

1 K 11^ II 

1 r||^ ,| 11^ II 11^ ii-i tc — re 

mean error := — > \ + bfc - bfc + - Cfc > ; weight error := -^^-i,—(5.1) 

6K I" nn III! iij ^ 

k=l 'I '' 

To evaluate the variable selection quality, we compute the true positive rate TPR := (TPR^ + 
TPRfe + TPRc)/3 and the false positive rate FPR := (FPRa + FPRf, + FPRc)/3, where 


TPR, := 


7^ 0 ; 7^ 0 ) . 

^ 7^ 0 ) 


FPRa := 


E — 0; [^k]i 7^ 0) 

Eil([afc]i = 0) 


and TPRf,, TPRc, FPRft, FPRc are defined analogously. 


5.1 Illustration of TTP Method 

Denote a sparse and unit space := {x G : ||x|| = 1, ||x||o < do}- In order to generate the 

random decomposition vectors a^, b^, from §'^^“^(^ 20 ); §'^^”^(^ 30 ); respectively for each 

k G [K], we first generate i.i.d. standard Gaussian entries of three components A := [ai, • • • , a^^], 
B := [bi, • • • , b;^], and C := [ci, • • • , c^], then truncate each column of A, B, C with cardinality 
parameters doi > do 2 , do 3 correspondingly, and finally normalize each column and aggregate the 
coefficients as Wk- 

We first empirically check that the tensor based on randomly generated components satisfies the 
incoherence Assumption 3.2. We set the rank K = 10, the dimension di = d 2 = = d, and the 

true sparsity level doi = do 2 = do 3 = do, and fix do = d/10 for each dimension d G [10,1000]. In 
this sparse tensor scenario, we compute the incoherence condition (/ defined in Assumption 3.2 with 
respect to do. Figure 2 shows that the incoherence decays polynomially in do, which demonstrates 
the polynomial bound in Assumption 3.2. 

We next evaluate the efficacy of the tuning procedure in Algorithm 2. We consider the example 
with dimensions di = 40, d 2 = 30, do = 20, true cardinality parameters doi = 8, do 2 = 6, dos = 4, 
and the rank K = 3. In Algorithm 2, we let the pre-specihed set of rank values be /C = {1,..., 8} 
and the pre-specihed set of cardinality values be Sj = dj x for j = 1,2,3. Among 

all the combinations, our tuning algorithm outputs K = 3 and Sj = dj x 10®'^. In the left plot of 
Figure 3, we show the plot of BIG as a function of K given the selected cardinality parameter, and 
in the middle and right plots of Figure 3, we compute the estimation error in recovering tensor 
decomposition component and evaluate the TPR/FPR with respect to Ratio = given 

the selected true K. Clearly, the selected ratio from our tuning algorithm leads to the minimal 
estimation error and the best variable selection performance. 
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Figure 2: Incoherence C of randomly generated sparse components versus sparse cardinality do- 



Figure 3: Illustration of the tuning procedure in Algorithm 2. The left plot shows BIC as a function 
of K, the middle plot and the right plot show the estimation error and variable selection performance 
with respect to Ratio= given the selected true K. 

5.2 Comparison with Competitive Methods 

In this subsection, we compare our TTP method with two competitors: the non-sparse tensor 
decomposition method in Anandkumar et al. (2014c) and the lasso penalized sparse tensor decom¬ 
position method in Allen (2012). In our TTP method, the random initialization in Section S.1.1 is 
employed and the tuning parameters are selected via Algorithm 2 with Sj = dj x 
(j = 1, 2,3). For a fair comparison, the tuning parameters Ai, A 2 , A 3 in the lasso penalized method 
are also tuned via BIC with the same pre-specified set of parameters In all three 

methods, we use the same true rank K such that the comparison is not mixed with the choice of K. 

In all simulations, we fix the true cardinality doj = 0.2dj (j = 1,2,3) and vary values of the 
dimension and the rank. In particular, we consider four scenarios: I: di = 1000, d 2 = 10, d^ = 
10, K = 1; II: di = 1000,^2 = 10,^3 = 10,A: = 2; III: di = 1000,^2 = 100,^3 = 10,A: = 1; IV: 
di = 1000, d 2 = 100, ds = 10,K = 2. Note that in scenarios III and IV, three dimensions di,d 2 , ds 
are different and three cardinality parameters doi,do 2 ,do 3 are also different. 

We calculate the averaged mean estimation errors, averaged weight estimation errors, and 
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averaged variable selection criterions TRP/FPR over 30 random replications. As shown in Table 
1, in all scenarios our TTP method achieves significantly better performance than the non-sparse 
tensor decomposition method in Anandkumar et al. (2014c) in both estimation accuracy and variable 
selection performance. In particular, our improvement in sparse mean vector estimation is more 
than 50% over all scenarios, which indicates the importance of variable selection in sparse tensor 
decomposition. In addition, compared to the lasso penalized method in Allen (2012), our method 
obtains better performance in recovering the tensor decomposition components in all four scenarios, 
i.e., smaller mean errors, with slightly worse performance in estimating the weight in two out of 
four scenarios. In terms of the variable selection performance, both our method and the lasso 
penalized method are able to correctly identify almost all true important variables, while ours 
tends to include more noisy variables than the lasso method. Next, we compare the computational 
costs of three methods using scenario III for illustration, in which case the tensor has 1 million 
((11^2^3 = 10®) entries. It takes our Algorithm 1 about 30 seconds to run one replication on a single 
laptop with 1.3 GHz processor and 4GB memory, which is comparable to the non-sparse tensor 
decomposition method in Anandkumar et al. (2014c) (30 seconds), and is slower than the lasso 
penalized method in Allen (2012) (10 seconds). This is partially due to the fact that our TTP 
method runs L initializations while the lasso penalized method only runs K initialization. Finally, 
based on the performance of our method across all scenarios, we observe that when dimension is 
fixed the recovery problem gets harder as K increases. This agrees with our theoretical finding that 
tR increases as K increases. 

Table 1: The averaged mean errors, the averaged weight errors, the averaged TPR/FPR variable 
selection performance in Anandkumar et al. (2014c) (Non-sparse), Allen (2012) (Lasso) and our 
method (Ours) in all examples of Section 5.2. The standard error is shown in subscript. The 
minimal error in each scenario is shown in bold. 


Scenarios 

Methods 

mean error 

weight error 

TPR 

FPR 

I 

Non-sparse 

0.295o.o218 

0.053o.oo84 

lo 

lo 


Lasso 

0.258o.0294 

0.016o.0058 

0.993o.oo43 

0.009o.0034 


Ours 

0.171o.0253 

0.021o.o053 

0.992o.oo61 

0.017o.oii7 

II 

Non-sparse 

0.300o.0195 

0.067o.oi28 

lo 

lo 


Lasso 

0.204o.oi48 

O.OO80.0013 

0.998o.ooo8 

O.OI60.0035 


Ours 

0.185o.0224 

0.022o.oo56 

0.992o.oo61 

O.O860.0215 

III 

Non-sparse 

0.086o.oi44 

0.015o.oioi 

lo 

lo 


Lasso 

0.055o.oo29 

0.002o.ooo4 

lo 

0.003o.0022 


Ours 

0.036o.oo42 

0.002o.ooo4 

lo 

O.OI60.0130 

IV 

Non-sparse 

O.I96o.o416 

0.071o.0260 

lo 

lo 


Lasso 

0.052o.ooi8 

0.002o.ooo3 

lo 

0 .002 o.ooi6 


Ours 

0.041o.o064 

0.002o.ooo3 

lo 

0.067o.o31i 
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6 Real Data Analysis 


This section investigates the efficacy of our TTP method on Click-Through Rate (CTR) prediction 
for online advertising. Additional real applications to high dimensional gene clustering will be 
discussed in Section S.3.2 in the online supplementary materials. 

The used advertising dataset comes from a major internet company and forms a third-order 
tensor {user-age-group, device-type, advertisement). Each tensor entry stores the real-valued CTR 
(ratio of users who click on an ad to the number of total users who see this ad) for the corresponding 
combination of user-age-group, device-type, and advertisement. The user-age-group consists of 5 
categories and the device-type consists of 2 types: PC and mobile. We extracted the most active 
100 advertisements that are widely reachable to each user-age-group and each device type. The 
constructed tensor is of size 5 x 2 x 100. We use the ads click tensor data 7i on Nov. 1, 2015 for 
training and use the ads click tensor data Ti on Nov. 2, 2015 for testing. Since the click event is 
generally very rare, the constructed tensor is very sparse (about 90% zeros), which indicates that a 
sparse tensor decomposition method could be suitable for the low rank tensor approximation. 

We compare our TTP method with two popular regression methods, linear regression and 
gradient boosting machine (CBM) (Friedman, 2001). In these two regression methods, the real¬ 
valued CTR is treated as the response and {user-age-group, device-type, advertisement) are treated 
as covariates. In our Algorithm 1, given the input tensor 71, we fix AT = 1 and the cardinality of 
{user-age-group, device-type, advertisement) as (5,2,10) to compute the decomposed components 
b,c. We evaluate both training error ||71 — wao bo cHj? and testing error ||72 — wao bocHi?. 
As shown in Table 2, our TTP method greatly improves both regression methods, which indicates 
a promising potential of our method in exploiting the tensor structure for CTR prediction. The 
advantage of tensor decomposition method over traditional classification algorithms has also been 
observed by Rai et al. (2014) and Zhe et al. (2015) for exploiting the binary tensor data. 

Table 2: CTR prediction errors via linear regression and generalized boosted regression model 
(CBM) on flattened tensor data and our sparse tensor decomposition on original tensor data. 


Methods 

Training error 

Testing error 

Linear regression 

0.189 

0.534 

CBM 

0.190 

0.533 

Ours 

0.141 

0.511 


7 Discussion 

In this paper, we propose a new sparse tensor decomposition method via a truncation step and 
establish both local and global statistical rates of the TTP estimate. Our method is applied to solve 
various high-dimensional problems, including high dimensional clustering and mixtures of linear 
sparse regressions. It is worth noting that our TTP method can also be adapted to solve other high 
dimensional latent variable models, e.g., mixed membership community detection (Anandkumar 
et ah, 2014a) and dictionary learning (Barak et ah, 2014; Peng et ah, 2014). 
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Supplementary Materials 

Will Wei Sun, Junwei Lu, Han Liu, and Guang Cheng 

This supplementary contains four parts: (1) Section S.l explains affiliated steps in our main 
algorithm; (2) Section S.2 discusses applications of our sparse tensor decomposition to high¬ 
dimensional latent variable models; (3) Section S.3 mentions additional simulation study and real 
data analysis on high-dimensional clustering; (4) Sections S.4 - S.6 are devoted to detailed technical 
proofs for the theoretical developments. 

5.1 Initialization and Clustering Procedures 

This section summarizes the initialization and the clustering procedure in our TTP algorithm. 
We propose two initialization procedures: one is a sparse SVD initialization and another one is 
a random initialization. The former has a nice theoretical guarantee, while the latter is simple 
and fast in practice. We further provide an efficient clustering procedure in order to generate all 
decomposition components. 

5.1.1 Initialization Step 

The first initialization is performed via the sparse SVD initialization, see Algorithm 3. Given a 
randomly generated Gaussian vector 6 , the algorithm first computes its truncated vector 6 , then 
computes the truncated vectors of the leading left and right singular vectors of the matrix T X 3 0 , 
and finally outputs the desirable initializations and . Given a^^^ and , the vector is 
computed based on the main update step in (2.6). The input cardinality parameter (si, 52 , 53 ) in 
Algorithm 3 is same as the one supplied in Algorithm 1. 

Algorithm 3 Initialization via sparse SVD 
1 : Input: tensor T, cardinality parameter ( 51 , 52 , 53 ). 

2 : Step 1: Generate a d-dimensional standard Gaussian vector 6 ~ ^(0,1^). 

3: Step 2 : Compute the sparse vector 6 = Truncate(0, maxjsi, 52 , 53 }). 

4: Step 3: Calculate ui and vi as the leading left and right singular vectors of T X3 0. 

5: Step 4: Compute the sparse vectors ui = Truncate(ui, 5i) and vi = Truncate(vi, 52 ). 

6 : Output: = Norm(ui), b^^^ = Norm(vi), and via (2.6) with input br'^^ 

We provide an intuitive explanation of this initialization procedure. Remind that T X 3 0 = 
T X 3 6 + £ X 30 is a multilinear combination of the tensor slices. Based on (2.2), we have 
T x^e = Y.i&[K] Wi{cj6)aihj G Intuitively, we can treat Wi{cJ6) as the singular value, 

and aj,bj as the left and right singular vectors. Although this is not an exact singular value 
decomposition since the spaces of [ai,..., a^] and [bi,..., bx] are not orthogonal, we show in 
Lemma S.4.2 that this algorithm eventually generates good initializations if we repeat this procedure 
many times. Most importantly, this initialization is shown to lead to the global statistical rate of 
the final tensor decompositions, see Theorem 3.9. 
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In practice, in order to save computational cost, we introduce an efficient random initialization 
procedure. Specifically, we can initialize and as follows. We generate two standard Gaussian 
vectors, then truncate them by only keeping a few largest absolute values and setting other entries 
as zeros, and finally normalize them to be of unit norm. The vector is again computed based 
(2.6), see Algorithm 4. This random initialization is shown to be very efficient in practice, although 
its theoretical analysis is still unclear. 

Algorithm 4 Random Initialization 
1 : Input: tensor T, cardinality parameter (si,S 2 ,'S 3 ). 

2 : Step 1: Generate ui from A(0,IrfJ and vi from ^( 0 , 1 ^ 2 )- 

3: Step 2: Compute the sparse vectors ui = Truncate(ui, si) and vi = Truncate(vi, S 2 ). 

4: Output: = Norm(ui), = Norm(vi), and via (2.6) with input b?^\ 


S.1.2 Clustering Step 

Algorithm 5 introduces the clustering step in identifying all the K decomposition components from 
the L generated estimators, see line 7 of Algorithm 1. It outputs the K decompositions sequentially. 
Within each loop, the algorithm finds the tuple (a, b,c) such that \T XiaX 2 bx 3 c| is maximized. 
The intuition of this step is that if|T XiaX 2 bx 3 c| is large for some (a, b,c), then these vectors 
are close to some true decomposition (a^, bj, cj), j G [K], Next, the algorithm removes all tuples 
that are close to (a, b, c) since these tuples eventually generate the same decomposition vector up 
to an error tolerance. This procedure is repeated until all K decompositions are generated. 

Algorithm 5 Clustering procedure 
1 ; Input: tensor T, set S = {(a,-, b,-, c,- ),t€[L|}. 

2 : For j = 1 to K Do 

3: Step 1: Find (a,b,c) = argmaX(-ab,c)eS Xi a X 2 b X 3 c|. 

4: Step 2: Perform N iterations of (2.4)-(2.6) with initialization (a, b,c) and denote the final 

update as (a^, bj, Cj). 

5: Step 3: Remove all tuples in S with min{||aT- ± a||, Hb,- ± b||, ||ct- ± c||} < 0.5. 

6 : End For 

7: Output: G [A]}. 


In Step 3 of Algorithm 5, the ± sign takes care of the possibility that (a, b, c) may have reverse 
sign of the true component. The choice of thresholding value 0.5 is not critical. In fact, in our 
experiments, if the distance between (a,-, b,-, c,-) and (a, b,c) is not small, smaller than 10 “'^, then 
their distance will generally be very large, greater than 1. Therefore, setting the thresholding as 
any constant higher than 10 “^ and smaller than 1 generates the same result. 

We next demonstrate the mechanism of the clustering procedure via a simulated example. We 
consider the same simulation setup as in Section 5.1 and fix di = ^2 = ds = 100, doi = do 2 = do 3 = 50 
and K = 5. For each k € [5], we compute the distance of min{||aT- ± a||, ||bT- ± b||, ||ct- ± c||} in 
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Step 3 of Algorithm 5. Figure S4 shows the results for A; = 1, 3, 5, where the thresholding value 0.5 
is shown in red in each plot. The cases of fc = 2 and 4 are similar and hence omitted. As shown 
in the plots, those initializations close to (a, b,c) have distance less than cq = 10 “^, which is the 
stopping criterion value used in our algorithm; while those initializations far away from (a, b,c) 
have distance around \f2. This ensures that the choice of thresholding value 0.5 is not sensitive, 
which empirically supports our above discussions. 


k=1 k=3 k=5 



0 20 40 60 80 100 0 10 20 30 40 2 4 6 8 10 

Initialization Initialization Initialization 

Figure S4: Illustration of the clustering procedure. Each circle represents the distance of each 
initialization vector to the cluster center. The line in read refers to the thresholding value 0.5 used 
in our algorithm. 


S.2 Applications of Sparse Tensor Decomposition 

This section discusses the applications of our sparse tensor decomposition to high-dimensional 
latent variable models. Tensor spectral method has been applied to various statistical models with 
latent variables including Gaussian mixture model (Dasgupta, 1999; Sanjeev and Kannan, 2001; 
Dasgupta and Schulman, 2007; Vempala and Wang, 2002; Belkin and Sinha, 2010; Kalai et ah, 
2010; Moitra and Valiant, 2010; Hsu and Kakade, 2013) and mixture linear model (Viele and Tong, 
2002; Chaganty and Liang, 2013; Yi et ah, 2014). We consider these two models under the high 
dimensional setting where the dimension is large and the parameters have sparsity structures. 

S.2.1 Sparse Gaussian Mixture Model 

The sparse Gaussian mixture model is a generalization of ordinary Gaussian mixture model by 
assuming that the mixed mean vectors are sparse. In particular, suppose each d-dimensional 
observation = 1, • • • ,n, is drawn from a Gaussian mixture model with probability density 
function (pdf) f{x) = l^ki ^k), where > 0 is the mixture weight and fk{x; ^fe) 

is the pdf of a multivariate normal distribution, 

fk{x;nk,'^k) = (27r)"‘^/^|Sfcr^'^^exp |-^(a; - - /Xfc)| . 

We assume ||/xi||o = • • • = ||/xa'||o = do. Notice that the supports of these vectors are not necessarily 
the same. In addition, to facilitate the high-dimensional clustering as in Pan and Shen (2007), we 
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further assume that a common diagonal covariance matrix is shared among the mixture components. 
That is, the observations in the A:-th cluster follow a multivariate Gaussian cr^I). 

Our goal is to recover Wk, for k = 1, - ■ ■ ,K via our sparse tensor decomposition framework 

based on the observations iCj, f = 1, • • • , n. The following Lemma relates the model parameters to 
the second and third moments such that our sparse tensor decomposition can be employed. 

Lemma S.2.1. (Hsu and Kakade, 2013, Theorem 1) Assume the center matrix [/xi,...,] 
has column rank K. The variance ci^ equals the smallest eigenvalue of the covariance matrix 
E[XoX]-E[X]oE[X]. Define M := E[XoXoX]-a^ X;li(IE[X]oeioei+eioE[X]oei+eioeioE[X]), 
where {ei, ..., is the coordinate basis for Then 

K 

M = '^ WkfJ-k o fJ'kO fJ-k- 
k=l 

According to Lemma S.2.1, the model parameters cr^ can be recovered based on the empirical 
covariance matrix, and Wk,tJ-k can be recovered by applying our sparse tensor decomposition 
procedure to Ai estimated from empirical moments. In this scenario, we treat the decomposition 
components equally, that is, = fj,k for k G [K], 

Remark S.2.2. Note that our sparse tensor decomposition procedure assumes unit decomposition 
vectors, i.e., ||/Xfc|| = 1. When the true means /Xfc are not unit, we provide a scaling procedure by 
first recovering Wk and then recover f^k via a similar strategy as in Hsu and Kakade (2013) and 
Anandkumar et al. (2014b). Denote M 2 := E[X o X] — Let U G be the orthonormal 

eigenvectors of M 2 , and D G be the diagonal matrix of positive eigenvalues of M 2 . Let 

W := UD-V2 and M := A4(W, W, W). We have 

K 

M = '^ ojiko flk, 

k=l 

where Jlk '■= ^k satishes \\Jlk\\ = 1,A: G [K]. Note that Jik G M^. Therefore, in Step 1 we 

can apply the robust tensor decomposition method (Anandkumar et ah, 2014b) to the non-sparse 
tensor Ai to obtain the weight Wk,k G [K], Then the mean vectors fXk can be recovered by applying 
our sparse tensor decomposition to Ai and rescaling according to weight Wk,k G [K] from Step 
1. In practice, when M 2 estimated from data is singular, we first apply a thresholding procedure 
(Bickel and Levina, 2008) on M 2 such that its eigen-decomposition is well defined. When the mean 
vectors /Xfc are jointly sparse, this thresholding procedure can efficiently shrink the corresponding 
non-important coordinates in M 2 to be zero. 

Remark S.2.3. We point out that the success of our sparse tensor decomposition relies on the 
incoherence condition in Assumption 3.2. When the mean vectors fXk does not satisfy this incoherence 
condition, we show a heuristic procedure to deal with such a situation. The key idea is to shift the 
original data such that the new cluster means are nearly orthogonal. For example, assume K = 2 
and the original true mean vectors are fii = (I, l)"*^ and fi 2 = (—1, —l)"*^- The incoherence condition 
fails completely since /xi and /X 2 are perfectly correlated. By moving the original data toward the 
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direction (—1, we obtain two new mean vectors JIi = (0,2)"'^ and ^2 = (—2,0)"'^, which are 
orthogonal such that our sparse tensor decomposition method becomes suitable. In practice, the 
true mean vectors are unknown. We suggest using the standard k-means algorithm to obtain the 
initial mean vectors and then shift the data such that new cluster centers are nearly orthogonal. In 
Section S.3.2, we demonstrate that this procedure works very well in practice. 

S.2.2 Mixtures of Linear Sparse Regressions 

The mixture of sparse linear model is a generalization of the sparse Gaussian mixture model. It 
has been employed in music perception, where covariate is the actual tone and response is the tone 
perceived by a musician (Viele and Tong, 2002). Applying a similar notation system as the Gaussian 
mixture model, we denote the number of mixture components as K. Given the covariates X G 
the mixture of linear sparse regressions is generated as follows. 

(i) Draw component label h ~ Multinomial(7r), for vr = (vri,..., ttk), 

(ii) Draw observation noise e from a known zero-mean distribution, 

{in) Draw response Y = PjX + e, 

where the coefficients /3i,..., Pk £ 1^'^ are high-dimensional sparse vectors satisfying the incoherence 
condition in Assumption 3.2 with ah = hh = Ch = Ph for h = 1,... ,K. 

In practice, given i.i.d. samples D = {(®i, yi),..., (xn, Un)} drawn from the mixture of linear 
regressions, we aim to learn the parameters 7r,/3 i,... ,Pk- Ghaganty and Liang (2013) proposed 
spectral experts for estimating these parameters via a tensor power method. However, in their 
approach, the parameters Pi, ■ ■ ■ ,Pk are not sparse. We combine the spectral experts algorithm 
in Ghaganty and Liang (2013) with the sparse tensor decomposition in this paper and derive a 
new approach to learn vr and the sparse coefficients Pi,., Pk- Denote (•, •) as a generalized dot 
product between two p-th order tensors such that {X,y) := Xix,...,ipyh,...,ip- 

Lemma S.2.4. (Ghaganty and Liang, 2013) Define the generated vector and tensor as, respectively, 
m := Y.h=i '^hPh and M := '^hPh ° Ph° Ph, then we have 

y = (m,X)+yi(X), (S.l) 

y3 = (M,XoXoX)+ 3E[€2](m,X)+E[e3]+r? 3 (X), (S.2) 

where the noise terms 'qpX) and r]s{X) have zero means for any covariate X G 

According to Lemma S.2.4, we can perform two low-rank regressions to recover m and Ai, 
and then apply our sparse tensor decomposition procedure on Ai to obtain the estimation for the 
desirable parameters. Specifically, according to (S'.!), we first estimate the sparse vector m via lasso 
regression (Tibshirani, 1996) with a tuning parameter Ai > 0, 

m:=argmin^ ((m, a;*) - -k Ai||m||i, 

{Xi,yi)ev 
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and then estimate the tensor M via a low-rank regression (Chaganty and Liang, 2013) 

Al:=argimn^ ^ tc, o cc, o aij) 3E[e^](m, -h E[e^] - +A3IIXII*, 

(xi,yi)&T> 

where A3 is a tuning parameter and ||Ad||* = d~^ Sf=i II[-A^]:,:lll* is the average nuclear norm over 
all d unfoldings. Finally, we perform our sparse tensor decomposition procedure on the estimator 
M. to recover the parameters vr, /3i,..., (3k- 


S.3 Additional Experiments 

This section examines the effectiveness of the proposed sparse tensor decomposition method in the 
high-dimensional clustering problem with both simulated examples and real data analysis. 


S.3.1 Simulation Study 

To assess the clustering performance, we define the cluster error (Sun et ah, 2012) as the estimated 
distance between an estimated clustering assignment ijj and the true assignment of the sample 
data £Ci,..., Xn- 


cluster error = 



= ip{xj)) ^ liirixi) = 'tl;{xj))-i< j} 


where |A| is the cardinality of set A. In this case, the mean error defined in (5.1) reduces to 
mean error = K~^ The weight error, TPR, and FPR can be computed analogously. 

The simulated data consist of n = 1000 observations Xi G i G [n] of dimension d = 10. 
First, cluster memberships y^’s are uniformly sampled from {1,2,3,4}. Then for each cluster y^, we 
generate 250 samples from Ai(/r(yj), cr^I), where 


/r(yi) = m l(yi = 1) + fi2 l(yi = 2) -h fJ-sHyi = 3) -h 7 x 4 l(yi = 4), 

and (T^ = 0.1. To examine the performance in various scenarios, the following four sparse models 
of mean vector /r(yi) are considered. The final mean vectors are normalized to have unit norm. 
Denote e* G as the basis vector whose z-th coordinate is 1 and the rests are zeros. 


• Model 1: /x^’s are orthonormal and each has do = 1 nonzero elements: /Xfc = for k G [4]. 

• Model 2: /x^’s are orthonormal and each has do = 2 nonzero elements: /xi = ei -|- 02, 7x2 = 
©3 + 64,7x3 = 05 -|- ee, 7x4 = 07 -|- eg. 

• Model 3: Tx^’s are not orthogonal with small incoherence parameter ^ = 0 . 2 : 7x4 = ei -|- 
0.2e2,7^2 = G2 + 0.2e3,7x3 = 03 -|- 0.2e4,7x4 = 04 -|- 0.2ei. 

• Model 4 : tx^’s are not orthogonal with large incoherence parameter ^ = 0 . 5 : 7x4 = ei -|- 
0.9e2, 7 x 2 = e2 -I- 0.9e3,7x3 = 03 -|- 0.9e4,7x4 = 04 -|- 0.9e4. 
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We evaluate the performance our procedure in reconstruction of the four clusters by taking 
Model 3 for illustration. Figure S5 shows the original four-cluster samples and the reconstructed 
samples generated from the distribution based on uifc, /c = 1,..., 4. Clearly, the reconstructed 
samples mimic the original observations very well. 


0 1 2 



-1 0 1 2 



Figure S5: Original samples and reconstructed sample in Model 3 of Section S.3.1. 

Finally, we report the cluster errors, the mean estimation errors , the relative estimation errors of 
weight, as well as the TPR/FPR variable selection performance. We compare two methods: one using 
empirical tensor M estimated from samples; another one using true tensor A4 based on population 
parameters. Table SI summarizes the results. First, our sparse tensor decomposition method 
successfully identihes the true important variables in all models, which shows the superior variable 
selection performance. Second, when the true mean vectors are orthonormal, i.e., incoherence 
parameter (' = 0 in Models 1-2, using true tensor will fully recover the mean vectors and the weight, 
while when incoherence parameter C 7 ^ 0 in Models 3-4, even decomposing the true tensor can not 
fully recover them. This observation agrees with our theoretical hndings in Theorems 3.6 and 3.9 
that the error e/j = 0(r/(T, do) + \/R/do) consists of two parts, one is due to sample error 0{r]{S, do)) 
and another one is due to incoherence condition reflected as 0{y/K/do)- Third, comparing the 
mean errors based on A4 and A4, we observe that the sample error overwhelmingly dominates the 
model error. In this case, our error rate €r significantly improves the rate shown in Anandkumar 
et al. (2014c), see the discussions after Theorem 3.6. 

S.3.2 Real Data Analysis 

We apply the proposed sparse tensor decomposition method to high-dimensional gene clustering 
for the Leukemia microarray dataset (Golub et ah, 1999). The goal is to distinguish two types of 
human acute leukemias: acute myeloid leukemia(AML) and acute lymphoblastic leukemia(ALL) 
based on the gene expression data. 

This dataset consists of 72 patients in total, 25 patients with AML and 47 patients with ALL. 
The Gene expression levels were measured by Affymetrix microarrays containing 3571 human genes 
after preprocessing (Dettling, 2004). Distinguishing ALL from AML is clinically significant for 
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Table SI: The cluster errors, the mean estimation errors, the weight estimation errors, as well as 
the TPR/FPR variable selection performance of our method in Models 1-4 of Section S.3.1. 


Tensor Type 

Model 

cluster error 

mean error 

weight error 

TPR 

FPR 


Model 1 (C=0) 

0.0361 

0 

0.0835 

1 

0 

Estimated A4 

Model 2 (C=0) 

0.0408 

0.0318 

0.0647 

1 

0 


Model 3 (C=0.2) 

0.0465 

1.5773 

0.0924 

1 

0 


Model 4 (C=0.5) 

0.1189 

1.0224 

0.2968 

1 

0 


Model 1 (C=0) 

0.0361 

0 

0 

1 

0 

True M 

Model 2 (C=0) 

0.0408 

0 

0 

1 

0 


Model 3 (C=0.2) 

0.0475 

0.0797 

0.0170 

1 

0 


Model 4 (C=0.5) 

0.1213 

0.0490 

0.2470 

1 

0 


successful treatment because those chemotherapy regimens for ALL patients are different from AML 
patients, in which case using ALL therapy for AML (and vice versa) cases may result in distinctly 
reduced cute rates and possible toxicities. 

Table S2: The selected numbers of informative genes and cluster errors in Leukemia dataset. 


Methods 

No. genes 

cluster error 

K-means 

3571 

2/72 

Reg. k-means 

211 

2/72 

Ours 

60 

2/72 


We compare our method (Ours) with s = 30 with the standard k-means algorithm (K-means) and 
the regularized k-means clustering method (Reg. k-means) in (Sun et ah, 2012). As an evaluation 
criterion, we compare the estimated clustering assignments to the available cancer types of each 
tumor. The comparison results are summarized in Table S2, where results of k-means and regularized 
k-means are from Sun et al. (2012). Clearly, our method achieves competitive clustering performance 
with much less selected important genes compared with the k-means and the regularized k-means 
clustering algorithms. 

S.4 Proof of Main Results 

We provide the proofs of the main results. First we prove the results in Theorem 3.6 for local 
analysis. Then we establish the results for global analysis in Theorems 3.9. For simplicity, in the 
following proofs we consider the case di = d 2 = = d and doi = <^02 = do 3 = do, and hence in our 

Algorithm 1 we let si = S 2 = S 3 = s. The extension of the proofs to a general case with different 
parameters in three modes is trivial but involves more complicated notations. 













S.4.1 Proof of Theorem 3.6 


Our proof consists of two steps. In Step 1, we show a general contraction result in one iteration 

to quantify the error of D(c,Cj) when the input estimators a and b satisfying D{aL,aj) < e and 

Dih,h,) < e. In Step 2, we carefully calculate the explicit contraction result by applying the 
assumptions on the perturbation error and initialization. In particular, we show that D(c,Cj) < 
cr + qeo, where €r is a non-contracting term and qeo with g < 1 is a contracting term. Then the 
desirable error bound is obtained by applying this explicit contraction result repeatedly. 

Step 1: The next lemma accomplished the first step. Dehne a function do) as 

/(e;K,do):=^(l + C 2 yj) e + (S.l) 

Q /O 

for some constants Cq, Ci, C 2 , > 0. When K = o(dg' ), the first two terms of /(e; K, do) converge 

to 0 and the last term is the contracting term. 

Lemma S.4.1. (General contraction result in one iteration) Consider model in (2.2) satisfying 

Q /o ^ 

Assumption 3.2, and assume ||T|| < CstCmax and K = o{dQ ). In addition, assume estimators a 
in (2.4) and b in (2.5) of our algorithm satisfy D{a,aj) < e and Zl(b,bj) < e for some j G [K], 
If the perturbation error (C(T,do + s) with s > do is small enough such that Ci^^do -|- s) < 
Wj{l — e^) — tC ma x/(e; K, do), then the update c in (2.6) satisfies, with high probability, 

^ V5wn,i,^f{e;K,do) + V5C{£,do +s) 

~ Wj{l - e"^) - Wnia.xf{e;K,do) - C{£,do +s)' 

If we further assume D{c,Cj) < e, then the update w = T XiaX 2 bx 3 C satisfies, with high 
probability, \w - Wj\ < 2wje^ + rcmax/(e; K, do) + ({£, do + s). 

The detailed proof of Lemma S.4.1 is discussed in Section S.5.1 in the Appendix. 

Lemma S.4.1 provides the error bound of one update in a general form. Clearly, when the input 
error e increases, the function f(e;K,do) increases and hence the output error bound of D{c,Cj) 
will be larger. Furthermore, if the perturbation error ^(T, do -|- s) increases, the problem is getting 
harder and the output error bound will also increase. Moreover, the output error bound in {S.2) 
improves over the input error e since the only contracting term in /(e; K, do) is in the order of 
when K = o(dg'^^). 

Step 2: In this step, by carefully employing the conditions on the perturbation and initialization, 
we provide the explicit contract result by simplifying the error bound in (S'.2). We show that its 
denominator is lower bounded by Wyam/‘^ and hence it is upper bounded by the sum of a contracting 
term and a constant non-contracting term. Then the desirable error bound follows after N iterations. 

Denote q ■= ^ (^+*^2TC'seo and q := We have /(eo; K, do) = CiVK/do+qeo- 

According to the initialization condition in Assumption 3.5, we have q < rcmin/ (dv/Sifmax) and hence 
q < 1/2 < 1 and/(eo; A', do) < Wmm/{Qwiaa.x)- This together with the condition/(T, do-|-s) < tCmin/6 
and the initialization condition eo < Wmin/ (SrCmax) implies that the denominator in (5.2) has the 
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desirable lower bounded, that is, 


> 

> 


Wj{l - el) - Wma.^f{€o]K, do) - r]{S,do + s) 


Wr, 


1 - 


w 


max 2 


Wn 


W^min 

1 1 
6 ~ 6 


^0 “ 


w 


--f{e;K,do)- 


r]{£,do + s) 


‘W^min — 


min 
^min 


Wr^ 


(S.3) 


This further validates that the assumption r/(T, do + s) < Wj{l — el) — W ma x/(en; K, do) in Lemma 
S.4.1 is fulfilled. 

Finally, we bound the whole error term of D(c, Cj) by showing that it can be written as a sum 
of a contracting term and a constant non-contracting term. Specifically, according to (3.2) and 
(5.3), in each iteration, we have 


D{c,Cj) < 

< 


V5^^max/(e; K, dp) + V5r]{£, dp + s) 

Wj{l - e2) - rcmax/(e; K, dp) - r]{£, dp + s) 

2v/5C2rCmax ,2^/5 _ , , , 

H-do + s) + geo = 


Wr, 


dp 


Wr, 


+ qeo, 


where e^ is a non-contracting constant term and qep is a contracting term. By iteratively applying 
above inequality, after N = n(log (f^)) iterations, we have, 


max^D{a^^\aj),D(b^^\hj),D{c^^\cj)'^ < 0{eR). 


The bound of weight \w — Wj\ < O^ept) follows directly. This ends the proof of Theorem 3.6. 


S.4.2 Proof of Theorem 3.9 


In order to show the global statistical rate of our procedure, we need to first quantify the error 
bound of the sparse SVD-based initialization in Algorithm 3, and then quantify the accuracy of the 
clustering process in Algorithm 5. Then the desirable global statistical rate follows by incorporating 
the local statistical rate result in Theorem 3.6. 

The following Lemma establish the error bound of the sparse SVD initialization. The idea is to 
show that the generated initialization is close to one of the true decomposition components. Define 


g{L) := V21n(L) 


ln(ln(L)) + Co 
2v'21n(L) 


V‘^HK), 


for some positive constant Cp, and denote 


fJ-R = 



/imin := mm 



,^J-R 


for some positive constants Ci, € 2 - Let g := {2fPR + g — 1)/{1 — n) for some 0 < ^ < 1. 
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Lemma S.4.2. (Sparse SVD initialization) Consider model in (2.2) satisfying Assumption 3.2, and 
assume K = O{do), suppose we run L initialization procedures in Algorithm 3 with L satisfying 


aiL) > 


4wmax(l + ^)VlogK 


Wmin -Cw max (1 + ^) ’ 

with fj. < u>min/(C^max) ” 1 , then at least one of the paris (ai'^^b®), r G [L], say j*, satisfies 


max • 


„ N n/C O U / d-Wmax/^min 1 + OV^ogK + aoy/sr]{£, do + s) 

.\D{ay,ai),D{hy,hi)\ < -^ 77 ^^- r ^ 1 ^ 

I ^ ) WrainaaiL) - aoy/sr][£,do +S) 

with high probability. 


The proof of Lemma S.4.2 is discussed in Section S.5.3 in the Appendix. Based on Lemma S.4.2, 
next we carefully quantify the condition on L such that the error in (S'.d) satisfies the required 
initialization condition. 

For sufficiently large do and when K = 0{do), it is easy to see that 7 defined in Assumption 
3.5 is lower bounded by 7 > min {r(;inin/12rcmax, u^min/( 8 \/ 5 C' 3 ri;max)} • Setting the upper bound in 
(SA) as min {ri;niin/12rcmax) w^min/( 8 \/ 5 C' 3 r(;max)} implies that, it is sufficient to have 

a{L) = £l(^ \/\ogK ^ - ^ —\/i^(<?, do + s) 

\T IWmin 

According to the condition on perturbation error r]{£,do + s) < (tC min /C5)Vs-^logK, we have 
y/sri{£,do + s)/( 7 rcmin) < \/^ogK /When 7 is small, this term is dominated by 7 “^\/log K. 
Therefore, it is sufficient to require the number of initialization L satisfies 

g{L) = 0 ( 7 - 271 ^), i.e. , L = iL^(V 7 L. 



Next, the justification of the clustering procedure can be adapt from Lemma 17 in Anandkumar 
et al. (2014c). That is, the clustering process in Algorithm 5 outputs K cluster centers that are 
O(eij) close to the true components of the tensor. Finally, the desirable global statistical rate follows 
by incorporating Lemma S.4.2 and the local statistical rate result in Theorem 3.6. This ends the 
proof of Theorem 3.9. ■ 


S.5 Proofs of Lemmas S.4.1-S.4.2 and Corollary 3.8 


In this section, we present the detailed proofs of Lemma S.4.1 of the general contraction result in 
one iteration and Lemma S.4.2 of the sparse SVD initialization. In addition, we provide a theoretical 
justification of the lasso penalized sparse tensor decomposition in Corollary 3.8. 

Before that we introduce an important definition to restrict the operation of a tensor on its 
partial entries. For an index set F = Fi o F 2 o with F) C [d], we denote Ff the restriction of the 
tensor T on the three modes indexed by Fi, F 2 and F 3 , respectively. That is. 




[d~]i,j,k 

0 , 


if z G Fi,j G F 2 ,and k G F 3 . 
otherwise. 
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S.5.1 Proof of Lemma S.4.1 


We prove it in three stages. In Stage 1, we bound D(c, wjCj), where c = T XiaX 2 b denotes the 
unnormalized and dense update in (2.6); in Stage 2, we bound D{c,Cj) for the normalized and 
sparse update c; in Stage 3, we bound the estimation error of weight update \w — Wj\. 

Stage 1: Denote Fi := supp(aj) U supp(a), F 2 := supp(bj) U supp(b), and F 3 := supp(cj) U 
supp(c), where c = Truncate(c/||c||, s). Let F := Fi o F 2 o F 3 . Consider the following update 

-'7FXiaX2b 

c = (S.l) 

\\Ff Xi a X 2 b|| 

where Tf denote the restriction of tensor T on the three modes indexed by Fi, F 2 and T 3 . Note that 
replacing c with c in ( 2 . 6 ) of our algorithm does not affect the iteration of c due to the sparsity 
restriction of Tf and the scaling-invariant truncation operation. Therefore, in the sequel, we will 
assume that c is redefined as c , i.e., c is redefined as 7 f Xi a X 2 b, and then bound D(c,WjCj). 

For two vectors u, v G the definition of D(u, v) can be reformulated as D(u,v) = 
sup 2 _Lv(z'''u)/(||z|| ||u||). When ||u|| = ||v|| = 1, this reformulation reduces to our original def¬ 
inition. We introduce this reformulation to measure the distance of non-unit vectors later on. Let 
z* _L SLj, zl _L hj denote the vectors achieving supremum value in the previous formulation of 
D(a, aj) and D(b, bj). Assume ||z*|| = ||z^|| = 1 . We can decompose a and b as 

a = (aj,a)aj+D(a,aj)z;, 
b = {hj,h)hj + D{h,hj)zl 

Consider any Zc T Cj with ||zc|| = 1, let Zc ;= Truncate(zc, F 3 ), we have 

(zc,c) = (zc,c) = 7 f xi a X 2 b X 3 Zc = Tf xi a X 2 b X 3 Zc -FTf xi a X 2 b X 3 Zc, (S.4) 


(5.2) 

(5.3) 


where the hrst equality is due to supp(c) C F 3 . 

According to (S.2) and (S.3), we can decompose Tf Xi a X2 b X3 Zc as follows. 


Tf xi a X2 b X3 


Zc = (aj,a)(bj,b)TF Xi slj X 2 bj X 3 Zc -F {aj,a)D(hj,h)TF Xi aj X 2 X 3 Zc 
+ D{aj,a){hj,h)TF Xi z* X 2 bj X 3 Zc + D{aj,a)D(hj,h)TF Xi z* X 2 zl XsZc 
= II -I- /2 -I- /3 -I- h- 


Next we bound each term individually. For the matrices A, B, C defined in (2.2), we have A = I-|- 
Ja, B^B = I-|-Jb; and C'''C = I-|-Jc- Assumption 3.2 implies that max{|| Ja||oo) || JbIIoo, || JcIIoo} < 
Define c := CDiag(u)) as the unnormalized matrix, and Ja * Jb as the Hadamard product 
(entry-wise multiplication) of Ja and Jb- We have. 


h < ITf Xi aj X 2 hj X3 zd = \T Xi a^ X2 bj X3 Zc)| = |zJc\j(Ja * Jb))^| < 


where the first equality is due to Lemma S.6.1 by noting that supp(aj) C Fi, supp(bj) C F 2 , 
and supp(zc) C F 3 , the second equality is due to the fact that for any u, vGM'^,T Xiux 2 V = 
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Wi{aLi,u) {hi,v)ci = c(A'''n* B~''v) and Zc -L Cj, and the last inequality is due to Assumption 
3.2 and ||zc|| < 1. 

Let z^ := Truncate(z^,supp(bj)), we have z^ _L bj since z^ _L hj. Therefore, we have Tp Xi 
aj X2 z^ X 3 Zc = Tf Xi aj X2 z^ X3 Zc due to the fact that supp(zj) C supp(bj) C F 2 . Moreover, 
Lemma S.6.1 implies that 7f Xi a^ X2 z^ x^Zc = T Xi a^ X2 z^ X3 Zc. Therefore, we have 


h < e| 7 > Xi a^ X2 Zft X3 Zc| = elT Xi a^ X2 z^ X3 Zc| = e 

C'o'^max 


zJc\j-[(Ja)/ * (B\j) 'z^] 


\\j 


T~*i 


< e c 


\il 


(Ja) 


B 


\il 


< 




where the second inequality is because ||zc|| < 1 and ||x=t=y|| < ||x||oo • ||y|| for any two vectors x, y, 
and the last inequality is due to ||z^|| < 1 and Assumption 3.2. 

Similarly, let z* := Truncate(z*,supp(aj)), we can bound via the same strategy as in F- 
Moreover, the bound for 1^ is desirable by considering the fact ||z*|| < 1, ||z^|| < 1, ||zc|| < 1, and 
the assumption on ||T||. In particular, we can show that 


h < elTp xi z* X2 bj X3 Zc| = e|T Xi z* X2 hj X3 z^ < + C2\j^^ e, 

li < e^lTp Xi z* X 2 z^ X3Zc| = e^|T Xi z* X 2 z^ X3Zc| < e^||T|| < C^Wms^e^- 


After we bound all four terms IuTf XiaX 2b x^Zc, our next step is to bound the error term 
Sp Xi a X2 b X3 Zc. Then the whole error rate of (zc,c) can be derived based on (S.d). Since 
||S||o < s, ||b||o < s and ||zc||o < IT3I < do + s, we have 


ITij'Xi a X2 b X3 Zc| < 


(Si? Xi a X2 b X3 



£ Xi a X2 b X3 



< r]{£,do + s). 


where the first inequality is due to ||zc|| < 1, and the first equality is due to Lemma S.6.1. 

Combining all above bounds, we have, (zc,c) < Wraa.^f{^\K,do) + r]{£,do + s). In order to 
compute the upper bound of the term D(c,WjCj), by definition, the rest part is to quantify the 
lower bound of ||c||. By definition, c = Tp XiaX 2b + iS’i? XiaX 2b. According to Lemma S.6.2, 
for F = Fi o F 2 o F 3 , we have 

TFXiaX 2b = r(;j(Truncate(aj, Fi), a)(Truncate(bi, T2), b)Truncate(cj, F3) 

ie[A] 

= ^ u;i(aj,a)(bi,b)Truncate(ci,F3) 

is [A] 

= Wj{a.j,a.){hj,h)cj + Wjjaj, a)(bj, b)Truncate(ct, F3), 

i¥=j 


where the second equality is due to supp(a) C Fi, supp(b) C F2, and the last equality is from 
supp(cj) C F3. Therefore, we have 


c|| > ||u)j(aj,a)(bj,b)cj 


^ rui (a*, a) (bj, b)Truncate(c,, F3) 


\\£p Xi a X2 b 


(S.5) 
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Since (aj,a) = y^l — D{a.j,a) and (bj,b) = yl — D{hj,h), we have the first term is bounded by 
||t(;j(aj,a)(bj,b)cj|| > Wj{l — e^). In addition, according to (5'.2) and {S.3), we have 

II 'y^^Wi{aLi,a){hi, b)Truncate(ci, F 3 ) 
i¥=j 


- \\'^Wi{ai,aj){hi,hj)ci + e ^ t(;i(aj, z*)(bi, bj)Truncate(ci, F 3 ) 


i¥=j 


i¥=j 


+ e y^^Wi{ai,aj){hi, z^)Truncate(c^, F 3 ) + Wjjaj, z*)(bj, z^)Truncate(cj, F 3 ) 








where the last inequality is due to the similar strategy when we bound (zc,c). Furthermore, 
according to Lemma S.6.3, we have \\£f Xi a X2 b|| < H^fII < + s). Plugging these 

bounds into (S'.S) implies that ||c|| > 'Wj{l — e^) — tC ma x/(e; K, do) — r]{£, do + s). Therefore, when 
Wj{l - e^) - Wma.xf{€] K, do) - TliS, do + s) > 0, we have 

^ ^ 'Wmax/(e; K, dp) + r]{£, dp + s) 

~ Wj{l - e^) - u;max/(e; K, dp) - r]{£, do + s)' 

Stage 2: We next bound D{c,Cj) for the normalized and sparse update c. According to 
the normalization-invariant property of D{u,v) = sup2_Lv(z'''u)/(||z||||u||), we have D{c,Cj) = 
D(c, WjCj), where c = c/||c||. In addition, denote Fc as the indices of c with the largest s absolute 
values, by Lemma S.6.4, we have 


|Truncate(c, Fc)'''cj I > |c'''cj 



(S. 6 ) 


where the right-hand side is an increasing function in |c''^Cj| for |c''~Cj| G [0,1]. According to 
our algorithm, c = Truncate(c, Fc)/||Truncate(c, Fc)||. Note that ||Truncate(c, Fc)|| < 1 since 
||c|| = 1 and the set of Fc only keeps part of the entries of c. Therefore, we have D{c,Cj) < 

yl — (Truncate(c, Combining this with {S. 6 ) implies that 


D{c,Cj) < 




D{c,Cj) < V5D{c,Cj) 


V^Wmax/(e; K, dp) + V5r]{£, dp + s) 

Wj{l - e2) - tCmax/Ce; K, dp) - r]{£, dp + s)' 


where the second inequality is due to dp < s. 

Stage 3: Finally, we bound \w — vjj\. Specifically, decomposing w leads to 


(5.7) 

(5.8) 


w = Wj{aj,a){hj,h){cj,c) -|- uij (a^, a) (b^ , b)(Truncate (cj, F 3 ); c) -|- £iir(a, b, c). 
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Using similar strategy as in Stage 1, we can show that 


\wj — w\ < 


Wj(l - {aj,a){hj,h){cj,c)'^ + | ^ a)(bi, b)(Truncate(cj, F3), c)| + |£:i;’(a,b,c)| 

< 2wje^ + tCmax/Ce; K, do) + r]{£, do + s). 


Combining Stages 1-3 leads to the desirable error rates of decomposition component and the weight 
estimation. This ends the proof of Lemma S.4.1. ■ 


S.5.2 Proof of Corollary 3.8 

Since the only difference between Corollary 3.8 and Theorem 3.6 is change the truncation to the 
soft-thresholding, we only need to show a similar result as (S.8) and all the remaining steps are same 
as the proof of Theorem 3.6. Let Sc is the support of the Cj, then ||c5^ —Cj5^||oo < D{c,Cj)/^/do < P- 
Therefore, if /c G Sc, we have \cjk\ > 2p and |cfc| > \cjk\ — ||c5^ — Cj^^Hoo > p and S{ci^,p) > 0. We 
can further derive that 

|S'(c,/9)'''cj| > |c'''cj| — \S{c,p)~^Cj — c'''cj| > |c'''cj| — \/dop and 


D{S{c,p),Cj) < ^/dop + D{c,Cj) < 


2wmax/(e; K, dp) + p{S, 2do) 

Wj{l - e2) - Wraa.^f{e-,K,do) - r]{S,2do)' 


The remaining part of the proof is same as Theorem 3.6. 


S.5.3 Proof of Lemma S.4.2 


The proof idea is outlined as follows. Remind that T Xp 6 = T X 3 6 + S Xp 0 is a multilinear 
combination of the tensor slices. Based on (2.2), we have T X 3 0 = T.ielK] ^ 

when di = d 2 = dp = d. Intuitively, we can treat Wi{cJ 6 ) as the singular value, and aj,bj as the 
left and right singular vectors. Although this is not an exact singular value decomposition since 
the spaces of [ai,..., ax] and [bi,..., bx] are not orthogonal, we can show that this sparse SVD 
algorithm eventually generates good initializations if we repeat this procedure many times. 

For the vector 0 ~ A^(0,1,^) in Algorithm 3, let ui and vi be the leading left and right singular 
vectors of T XpO. Let A := Diag(u;)C''~0 G M^, and Ai := maxj |Ai| and A2 := maxj^i |Aj|. Moreover, 
denote = Diag(tc)C''^0T-. Recall that C is defined in Assumption 3.2. Adapted from Lemma 8 
in Anandkumar et al. (2014c), we have, with high probability, 

A^-^ ^ > WrcmP(L) and ^ < 4r(;max(l + rj) (S.9) 


Let the set of support F := {supp(ai)Usupp(ui)}o{supp(bi)Usupp(vi)}o{supp(ci)Usupp(0)}, 
We can show that the algorithm has the same output if we replace F X 36 with TpXpO in Algorithm 
3. Decomposing Tp — TpXpO + Sp X3 0, and adapting Lemmas 8-9 in Anandkumar et al. 
(2014c) on F, we have 


max{D(ui,ai),D(vi,bi)} < 


Pm\nX 2 T p X3 0 

^Ai - \\£p X3 0|| 
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Let 6 = 0/||0||, then we have \\£p X3 0|| < ||0 ||\\£p X3 0|| < ||0|| ||<?^|| < ||0||?y(<S, do + s) where 
the second inequality is due to ||0|| = 1. Next we bound ||0||. Note that the vector 0 consists 
of do i.i.d. standard normals and the rest are all zeros. According to Lemma S.6.5, we have 
.P[||^|| > ^ g-(“o-i)^<io/2_ Therefore, 


P 


\\£p X3 0|| < ao\/sr/(£:, do + s) 


> 1 


_ g-(ao-l)^cio/2 


This together with (5.9) leads to the desirable result, that is, if 

'U^min C^max(l T /i) 

with // < u>min/(C^max) ” 1 and some 0 <]1 < 1, then with high probability. 


max ■ 


:{z)(a5.°Ui),^(b5*\bi)} 

This ends the proof of Lemma S.4.2. 


< 


4rCmaxMmin(l + C)Vlog-^ + ao^/sT]{£,do + s) 
WmmJigiL) - ao do + s) 


S.6 Auxiliary Lemmas 

The following auxiliary lemmas S.6.1-S.6.4 are useful to show the general contraction result in one 
iteration, and Lemma S.6.5 on the tail bound for chi-squared variable is useful when we show the 
error bound of the sparse SVD-based initialization. 

Lemma S. 6 . 1 . For any tensor T G and an index set F = Fio F20 with F) C {1,..., d}, 

for any vectors x, y, z G M'^, if supp(x) C Fi, supp(y) C F 2 , and supp(z) C F3, we have 

7>Xixx2yx3Z = T Xixx 2 yx 3Z. 

Proof of Lemma S. 6 . 1 : By definition, we get 7> Xi x X2 y X3 z = 

Since [TF]i,j,k 7^ 0 only when i G Fi,j G Fi and k G F3, we have Tf Xi x X2 y X3 z = 
'i 2 ieFi,jeF 2 ,keF 3 ^iyj'^k[T]ij,k- Due to the assumption that supp(x) C Fi, supp(y) C F 2 , and 
supp(z) C F3, we get the desirable result, 

7>XixX2yX3Z= ^ XiyjZfc[7lj,j,fc = XiyjZk[T]ij,k = T Xixx 2 y X 3 Z. 

i&Fi,j(^F 2 ,kGF 3 i,j,k&[d] 

Lemma S.6. 2 . For any tensor T G and an index set F = Fio F20 F3 with F C {1,..., d}, 

if T = Wiai o bj o a, we have 

Tf = r/;iTruncate(aj, Fi) o Truncate(bj, F 2 ) o Truncate(cj, F3). 
ielK] 

This is a direct application of the sparsity of 7 f and hence the proof is omitted. 

Lemma S.6. 3. For any tensor T G M'^xdxd g^y vectors x, y G with ||x|| = ||y|| = 1, we 

have \\T Xi X X2 y II < lirii. 
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Proof of Lemma S.6.3: By definition of tensor norm and the property ||x|| = ||y|| = 1, we have 
||T|| = sup \T xi u X 2 V X 3 w| > sup \T xi X X 2 y X 3 w| 

||u|| = ||v|| = ||w||=l |w||=l 

= sup |w'''T xi X X2 y| = ||T xi X X2 y||, 

||w||=l 

where the last equality is due to the fact that for any vector z G ||^||||^|| < 1. 

Lemma S.6.4. (Yuan and Zhang, 2013, Lemma 12) Consider a sparse vector x with supp(x) = F-^ 

and |Fx| = do- Let Fy = supp(y, s). If ||x|| = ||y|| = 1, then 

|Truncate(y,Fy)’^x| > |y’^x| - y^min |^1 - (y'^x)^, (^1 + Y^)(l - (y^x)^) 

Lemma S.6.5. (Laurent and Massart, 2000, Lemma 1) For a Chi-squared variable Y ~ X^(d), we 

have, for all x > 0, P[Y — d> 2\fdx -|- 2x] < e~^. 
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